Introduction

The goal of this document is to provide a high level view of all of the major recommendation system algorithms available

Specifically, for each algorithm:

A lot of the content in this document is from the book Recommender Systems: The Textbook (Aggarwal 2016) (a phenomenal resource), although you can also see a list of the other references here.

Although I use

\[r_{ijc}=\text{rating of item } j \text{ by user } i \text{ in context } c\]

as the outcome of interest throughout this document, this can be more generally understood as

\[\text{user } i \text{'s affinity for item } j \text{ in context } c\]

The exact outcome of interest will depend on the particular recommendation domain (e.g. the outcome of interest might be ‘probability of click’ in a web recommendation context, or it might be ‘explicit user rating of video X’ in a media streaming context).

I provide code implementations for some of the recommendation algorithms that I describe. Most of these scripts use this python data simulation class to generate synthetic data.

Here are the current model code implementations available:

(although these scripts run from end to end, they are not intended to be used as is in a production process. They are intended to serve more as a high-level blueprint - an illustration of the algorithm components and of how they work together)

Section Section Status
Introduction partially completed
code: data simulator working but incomplete documentation
code: embeddings in TensorFlow needs final edit
code: embeddings in PyTorch needs final edit
code: arules item/item recommender needs final edit
code: matrix factorization TensorFlow needs final edit
code: matrix factorization PyTorch incomplete (get complete code from jupyter notebook)
code: graph-based collaborative filtering needs final edit
code: Deep & Cross model (PyTorch) working but incomplete documentation
code: 2-Tower Model (TensorFlow) needs final edit
Goals of Recommender Systems COMPLETED
User Embeddings & Item Embeddings COMPLETED
Design Patterns COMPLETED
Collaborative Filtering needs final edit
Collaborative Filtering: Neighbourhood: User-User Similarity needs final edit
Collaborative Filtering: Neighbourhood: Item-Item Similarity needs final edit
Collaborative Filtering: Neighbourhood: Combining User-User and Item-Item Similarity needs final edit
Collaborative Filtering: Matrix Factorization needs final edit
Collaborative Filtering: Neighbourhood: Graph-Based needs final edit
Collaborative Filtering: Naïve Bayes needs final edit
Content-Based Recommendation needs final edit
Content-Based Recommendation: Raw Text Preprocessing needs final edit
Raw Text Preprocessing python function need to add my existing python script (and document)
Creating User & Item Characterization Vectors still TODO
Supervised Learning needs final edit
Multi-Armed Bandits needs final edit
Vincent’s Lemma: Serendipity needs final edit
Association Rules-Based Recommendation needs final edit
Sequential Pattern Mining still TODO
Clustering-Based Recommendation still TODO
Graph-Based Collaborative Filtering needs final edit
Matrix Factorization (Latent Factor Models) needs final edit
Naïve Bayes Collaborative Filtering needs final edit
Knowledge-Based Recommendation needs final edit
Knowledge-Based Recommendation: Constraint-Based needs final edit
Knowledge-Based Recommendation: Case-Based needs final edit
Hybrid Systems needs final edit
Graph Neural Networks (GNNs) needs final edit
Tradeoffs Between Various Recommendation Algorithms partially completed
Factorization Machines needs final edit
Incorporating Context needs final edit
Incorporating Context: Contextual Pre-Filtering needs final edit
Incorporating Context: Contextual Post-Filtering needs final edit
Incorporating Context: Contextual Modelling needs final edit
Incorporating Context: Contextual Modelling: Contextual Latent Factor Models partially completed
Incorporating Context: Contextual Modelling: Contextual Neighbourhood-Based Models still TODO
Session-Based Recommendation still TODO
Wide & Deep Model needs final edit
Deep & Cross Model needs final edit
Two Tower Model needs final edit
Integrating Latent Factor Models with Arbitrary Models partially completed










Goals of Recommender Systems

[back to contents]

  1. Relevance: The primary goal of standard recommender systems is to highlight, for each user, the subset of items which are most relevant to them. In other words, to highlight to each user the items which they would be most interested in (the items with the most utility to them).

However, there are some important secondary goals that are also very important in many cases:

  1. Novelty: Recommended items should be ones that a user has not seen before (or ones that the user could not easily find on their own).

  2. Serendipity: Item recommendations should sometimes be unexpected (pleasantly surprising) to the user. Serendipity is “finding valuable or pleasant things that are not looked for” (Kaminskas and Bridge 2016).

“New” and “Surprising” are not the same thing. Image source: author

  1. Diversity: “In information retrieval.. [covering] a broad area of the information space increases the chance of satisfying the user’s information need” (Kaminskas and Bridge 2016). This is because a user’s intent is often ambiguous (e.g. whether “bat” refers to an animal or to a piece of sporting equipment), and returning a diverse result set makes it more likely that what the user is looking for is in it. In other words, it’s often a good idea to hedge your bets. This concept is similarly applicable to recommendation systems: since one can never been sure exactly what is most relevant to a particular user, it is safer to recommend a diverse set of options to them.

  2. Coverage: “Coverage reflects the degree to which the generated recommendations cover the catalogue of available items” (Kaminskas and Bridge 2016). This is important both for users (since it improves the usefulness/depth of the system) and for business-owners (because showing users the same small subset of the item catalogue might impact stock level management of physical items, and also because there has been a general societal shift in consumer demand for products in the long tail of the catalogue (Armstrong 2008)).

  3. Non-Offensive: Certain items (or specific item combinations) can be worse than irrelevant for a particular user - a recommendation might actually offend them. An example is an item recommendation (or combination of items) which perpetuates a racial stereotype. It can be very important to identify these offensive user/item combinations since a single offensive recommendation can result in the permanent loss of a user.

  4. Responsibility/Compliance: It is sometimes irresponsible (or illegal) to recommend certain user/item combinations (e.g. recommending alcohol to a recovering alcoholic, or guns to a minor).

  5. Long-Term Engagement: In many recommendation domains, the primary business goal is to grow a base of engaged long-term (and returning) users. However, it is often quite difficult to objectively measure performance on a long-term objective like this, and more measurable short-term proxies tend to be monitored instead. Optimising for a short-term proxy objective (such a click rate) can sometimes actually be detrimental to a true long-term objective (such as % of users who return). An example of this is the use of overly sensationalistic or purposefully controversial content in a news portal - this is likely to draw a lot of short-term attention, but users are unlikely to return. Good item recommendations should promote long-term user engagement with the system.

  6. Perception of System Intelligence: A single bad recommendation (an obvious mistake) can ruin a set of otherwise perfect recommendations. This is because users tend to be naturally distrustful of automated systems (sometimes even actively seeking out their flaws in order to validate their skepticism). It can sometimes be more important to ensure that the weakest recommendation in a set is not too bad than it is to ensure that the strongest recommendations in it are very good.

There is a good discussion of the topic (the multiple goals of recommender systems), and of how these outcomes can be optimized and objectively measured, in the paper Diversity, Serendipity, Novelty, and Coverage: A Survey and Empirical Analysis of Beyond-Accuracy Objectives in Recommender Systems (Kaminskas and Bridge 2016).

Some further notes:










User Embeddings & Item Embeddings

[back to contents]

The embedding is a powerful tool, included as a component in many different models. It is simply a real-valued vector which encodes useful information about an entity, optimized for use in a specific task.

For example: in a recommendation context, we might be able to learn an informative 5-dimensional numeric vector representation (embedding) for each of our users, such as

\[\text{embedding of user ID 69420}:\quad\big[0.235,-0.981,99.64,-1.4,-4.1\big]\]

We can then directly use this embedding in various downstream applications, such as:

An embedding can also be interpreted as a representation of an object in a latent space, meaning a vector space in which the position of an embedding vector in the space provides rich information about the object (not necessarily meaningful to a human, but definitely a highly predictive feature to include in a model).

Embeddings are especially useful in a recommendation context, since they are very effective at turning high-dimensional sparse vectors into dense low-dimensional ones (which are much easier to use in a model, and good for combatting overfitting).

The choice of the dimension of the embeddings, and the algorithm used to learn them (which can be supervised or unsupervised), are both hyperparameters to be selected or optimized, and are definitely highly problem specific.

Here is a (definitely non-exhaustive) list of methods for creating embeddings:

Here is python code showing how embeddings are included in a TensorFlow model

Here is python code showing how embeddings are included in a PyTorch model










Design Patterns

[back to contents]

Here are some examples of general recommendation system strategies:










Collaborative Filtering

[back to contents]

Ratings Matrices. Image source: author

Collaborative Filtering refers, in general, to the process of inferring the unobserved preferences of entities (e.g. users) using the observed preferences of other entities (e.g. other users). It is “collaborative” in the sense that entities (unwittingly) contribute information toward each others recommendations.

Collaborative Filtering: Neighbourhood: User-User Similarity

image source: author

  • Predict an unobserved rating for item \(j\) by user \(i\) as the average rating of item \(j\) across the \(k\) users most similar to user \(i\) who have rated item \(j\). This average can (if desired) be weighted proportionately to each user’s similarity to user \(i\).

  • Similarity between users is traditionally defined using the distance between their item rating vectors (i.e. theirs rows in the user/item matrix). Vector distance metrics such as cosine similarity, euclidean distance, manhattan distance etc. can be used for this.

  • ‘Closest \(k\) neighbours’ can (if desired) be replaced by ‘all neighbours within a distance of d’.

  • Compared to Item-Item Similarity, User-User Similarity tends to produce more serendipitous - if sometimes less relevant - recommendations (see also Combining User-User & Item-Item Similarity).

Collaborative Filtering: Neighbourhood: Item-Item Similarity

image source: author

  • Item-Item collaborative filtering is mathematically identical to User-User, except that the calculation is performed on columns of the user/item rating matrix rather than columns

    i.e. an unobserved rating for item \(j\) by user \(i\) is estimated as the average rating by user \(i\) over the \(k\) most similar items to item \(j\) that user \(i\) has rated.

  • Compared to User-User Similarity, Item-Item Similarity tends to produce more relevant - if sometimes more boring/obvious - recommendations (see also Combining User-User & Item-Item Similarity).

Collaborative Filtering: Neighbourhood: Combining User-User and Item-Item Similarity

Since any missing (unobserved) entry \(r_{ij}\) in the user/item ratings matrix can be estimated using either User-User Similarity (\(\hat{r}_{ij}^{(user)}\)) or Item-Item Similarity (\(\hat{r}_{ij}^{(item)}\)), a given missing entry can also be estimated as a weighted average of the two:

\[\hat{r}_{ij} \quad=\quad \alpha \space \hat{r}_{ij}^{(user)} + (1-\alpha) \space \hat{r}_{ij}^{(item)} \quad,\quad\alpha\in[0,1]\]

..where the hyperparameter \(\alpha\) can be used to control the balance between recommendation relevance and recommendation serendipity.

Collaborative Filtering: Matrix Factorization

Matrix factorization refers to the process of learning a low-rank approximation of the user/item ratings matrix, then using this low-rank representation to infer the missing (unobserved) ratings in the matrix.

image source: author

For more information, refer to Matrix Factorization (Latent Factor Models)

Collaborative Filtering: Neighbourhood: Graph-Based

When using a neighbourhood-based collaborative filtering approach, sparsity of the ratings matrix can sometimes make it impossible to obtain a satisfactory set of similar users (items) for some of the users (items). This problem is elegantly solved by representing the relationships between users and/or items using a graph, since it allows one to measure the similarity between users (items) via intermediate users (items) e.g. users are considered more similar if they have a shorter path between them in the graph (even if they have no items in common).

Image Source: author

For more information, refer to Graph-Based Collaborative Filtering

Collaborative Filtering: Naïve Bayes

The unobserved entries in the user/item matrix can be estimated by modelling the item-rating process as a generative probabilistic process (i.e. the probability of a particular user giving a particular rating to a particular item is governed by a probability distribution).

\[\begin{array}{lcl}p\Big(r_{uj}=v_s\Bigl|\text{observed ratings in } I_u\Big) &=& \displaystyle\frac{p\Big(\text{observed ratings in } I_u\Bigl|r_{uj}=v_s\Big) \times p\Big(r_{uj}=v_s\Big)}{ p\Big(\text{observed ratings in } I_u\Big)} \\ &\propto& p\Big(\text{observed ratings in } I_u\Bigl|r_{uj}=v_s\Big) \times p\Big(r_{uj}=v_s\Big) \\ r_{uj} &=& \text{rating of item } j \text{ by user } u \\ v_s &=& \text{item rating (one of } l \text{ discrete ratings } \{v_1,v_2,...,v_l\}\text{)}\\ I_u &=& \text{set of items already rated by user } u \\ \end{array}\]

Each of the quantities in this formula can be estimated directly from the data in the observed user/item ratings matrix.

For more information, refer to Naïve Bayes Collaborative Filtering










Content-Based Recommendation

[back to contents]

Example User Item Consumption History
user_ID movie user_rating length_hours origin_country budget_millions genre description thrilling exploding adventure exotic drama love
46290 movie A 5 1.5 USA 50 action a thrilling exploding adventure 1 1 1 0 0 0
46290 movie B 3 2.0 India 60 drama an exotic adventure full of drama 0 0 1 1 1 0
46290 movie C 4 1.5 UK 100 action things exploding everywhere 0 1 0 0 0 0
46290 movie D 1 3.0 USA 4 romance an american love story drama 0 0 0 0 1 1

Content-based recommendation systems generate recommendations using item attributes data. For example, over time one could build an item attribute profile for a particular user (e.g. “user likes Italian and Indian food, but never orders red meat or chilli”), and use this profile to tailor their future item recommendations.

An example of a content-based model is:

\[\begin{array}{lcl} \hat{r}_{ij} &=& \text{predicted rating/affinity for item } j \text{ by user } i \\ &=& f\Big(\overset{\rightarrow{}}{\mathbf{v}}_j, \mathcal{D}_L^{(i)}\Big) \\ &=& \text{some function of the attributes of item } j \text{ and the attributes of the items that user } i \text{ has previously consumed} \\ \overset{\rightarrow{}}{\mathbf{v}}_j &=& \text{vector of attributes of item } j \\ \mathcal{D}_L^{(i)} &=& \text{the set of items previously consumed by user } i \text{ (the item attribute vectors)} \\ f() &=& \text{any chosen function (e.g. nearest neighbour model)} \\ \end{array}\]

Some examples of function \(f()\) are:

  1. The cosine distance between \(\overset{\rightarrow{}}{\mathbf{v}}_j\) and an aggregation of the vectors in \(\mathcal{D}_L^{(i)}\) (mean/max/sum etc.)

  2. A supervised learning model using item attributes \(\overset{\rightarrow{}}{\mathbf{v}}_j \space, \mathcal{D}_L^{(i)}\) as features. Note that a content-based system builds a separate model for each individual user, which is notably different to the global models trained over all users described in this section.

    Example: training a regression model to predict movie ratings for user 46290 using the data in the table “Example User Item Consumption History” above

    In this particular context, it is much more important that the chosen model is robust to overfitting (e.g. elastic net, naive bayes), since in this case the data is likely to be wide (many features) and short (few examples).

  3. Association Rules-Based Classifiers of the form

    \(\{\text{item contains feature set A}\}=>\{\text{rating="like"}\}\)

    Example: \(\{\text{item_material="leather", item_colour="red"}\}=>\{\text{rating=dislike}\}\)

    Note, again, that these rules are learned separately for each user (i.e. a particular rule applies only to 1 user)

    Refer also to Association Rules-Based Recommendation, which describes the learning of rules which apply globally (i.e. to all users).

  4. A neighbourhood-based model: the predicted rating for a new item (attribute vector \(\overset{\rightarrow{}}{\mathbf{v}}_j\)) is calculated as the aggregated rating (vote) over the closest \(k\) items to \(\overset{\rightarrow{}}{\mathbf{v}}_j\) in \(\mathcal{D}_L^{(i)}\) (using cosine similarity, euclidean distance, manhattan distance etc.)

A content-based strategy is particularly effective in situations in which:

  1. There is rich and predictive item attribute data available (structured or raw text)

  2. Past user item consumption is predictive of their future preferences

Compared to collaborative filtering:

  1. content-based recommendation can robustly recommend items with few or no ratings (i.e. it alleviates the cold start problem for new items)

  2. content-based recommendation tends to produce more relevant (if also more boring and obvious) recommendations. This is because it cannot recommend items outside of the users historic item attribute consumption (i.e. it won’t recommend items outside of the scope of what they have consumed in the past)

If users are able to explicitly specify their own preferences (like a knowledge-based recommendation system), then this information can be incorporated into their item attribute preference profile and affect their recommendations.

Content-Based Recommendation: Raw Text Preprocessing

[back to contents]

Example Item Text Data (terms with Inverse Document Frequency)
item_ID raw_text_description red_idf blue_idf bikini_idf shoe_idf pants_idf strap_idf formal_idf sport_idf suede_idf cotton_idf
111 Strappy Red Bikini 0.5 0.0000000 1 0.0 0 0.5 0 0 0 0
112 Blue Suede Shoes 0.0 0.3333333 0 0.5 0 0.0 0 0 1 0
113 Formal Cotton Pants (Blue) 0.0 0.3333333 0 0.0 1 0.0 1 0 0 1
114 red sport shoe (football) with blue strap 0.5 0.3333333 0 0.5 0 0.5 0 1 0 0

In many recommendation contexts, there is typically a lot of raw text item description information available. Raw text can be extremely predictive, but it often requires a lot of data preprocessing steps in order to be used by a predictive model.

Item text is most simply coded in a (cleaned) bag of words representation, in which the component words are recorded in a 1-hot encoding format i.e. the order of the words is ignored, and only their occurrence (or their frequency of occurrence) is recorded. It is possible to include the information in the word order (i.e. proper language understanding) using a more complex model (such as a Recurrent Neural Network or Transformer) but this added complexity (and resource requirement) would need to be justified by an increase in model performance.

Here is a description of some common raw text preprocessing (cleaning) tasks:










Creating User & Item Characterization Vectors

[back to contents]

TODO

refer also to Raw Text Preprocessing










Supervised Learning

[back to contents]

The problem of predicting a particular user’s affinity for a particular item can be framed as a supervised learning problem, allowing us to use any one of the available plethora of performant classification, regression or ranking models (generalized linear models, gradient boosting, random forest, deep learning etc.).

The user rating (affinity) for a particular item is modelled as a function of the user’s attributes, the item’s attributes and (possibly) information on the recommendation context:

\[\begin{array}{lcl} r_{ijk} &=& f\Big(\mathbf{x}^{(user)}_i, \mathbf{x}^{(item)}_j, \mathbf{x}^{(context)}_k\Big) \\ \mathbf{x}^{(user)}_i &=& \text{vector of user attributes} \\ \mathbf{x}^{(item)}_j &=& \text{vector of item attributes} \\ \mathbf{x}^{(context)}_k &=& \text{vector of recommendation context information} \\ \end{array}\]

A strength of this method is that it can mitigate the recommendation cold start problem (in which recommendations cannot be generated for new users and/or items) by sharing information across users and items.

Note that user/item interaction data (à la Collaborative Filtering) can be included in a supervised learning model by including a users’ row in the ratings matrix (or some aggregation of it) in the user’s feature vector \(\mathbf{x}^{(user)}_i\), or by including an item’s column in the ratings matrix (or some aggregation of it) in the item’s feature vector \(\mathbf{x}^{(item)}_j\).

Some specific examples of recommender architectures based on supervised learning are:

  1. Factorization Machines

  2. The 2 Tower model

  3. The Wide & Deep model

  4. The Deep & Cross model

Another use of supervised learning in generating item recommendations is to build a separate model for every individual user using item attributes as model features: refer to Content-Based Recommendation.










Multi-Armed Bandits

[back to contents]

Multi-armed bandit algorithms are a class of reinforcement learning model which are designed to maximise some objective in an environment in which one must repeatedly choose from one of a discrete set of available options (observing a noisy reward signal after each decision).

image source: author

The classic analogy for this is a gambling machine with multiple levers, each lever having an unknown reward distribution: the player must balance exploration (pulling levers not pulled much before) with exploitation (pulling levers historically observed to generate high reward) in order to maximise total reward over multiple lever pulls (iterations).

Recommendation systems often favour items for which there is a lot of observed user interaction data, and multi-armed bandit algorithms aid the system in exploring the unexplored items in the catalogue in a principled way. They do this by directly modelling uncertainty and exploration.

Multi-armed bandit algorithms are also important in a recommendation context in which the system must make decisions online i.e. a user reacts to a provided set of item recommendations and there is insufficient time to retrain a model incorporating this information before the user requires a new set of item recommendations (recommending them the same items again is a bad user experience).

Example: item recommendations are available from 5 different recommendation models (all trained offline) and we would like to investigate which of the 5 (or what weighting of them) is most appropriate for a particular user. We would like to respond to user feedback (e.g. click/no click on a recommended product) in real time i.e. that user’s feedback will affect what they are recommended next, without us having to wait for the models to retrain.

Some examples of popular multi-armed bandit algorithms are:

  1. Thompson Sampling

  2. Upper Confidence Bound (UCB)

  3. \(\epsilon\)-Greedy (and variants)

An extension called Contextual Multi-Armed Bandit algorithms explicitly incorporate context into the arm-selection decision. Some examples of these are:

  1. linUCB (L. Li et al. 2010)

  2. Greedy Linear Bandit & Greedy-First Linear Bandit (Bastani, Bayati, and Khosravi 2017)

  3. RegCB (Foster et al. 2018)










Vincent’s Lemma: Serendipity

[back to contents]

This is a simple and heuristic method (created by Vincent Warmerdam) that was found to substantially increase the coverage of item recommendations in a commercial movie recommendation context.

It is a system for next item recommendation.

It shares some conceptual similarities with Association Rules-Based Recommendation

\[\begin{array}{lcl} R_{j\rightarrow{}i} &=& \text{relative affinity for item } i \text{ among item } j \text{ consumers compared to the rest of the user population} \\ &=& \displaystyle\frac{\text{% of item } j \text{ consumers have also consumed item }i}{\text{% of non-item } j \text{ consumers have also consumed item }i} \\ &=& \displaystyle\frac{p(s_i\bigl|s_j)}{p(s_i\bigl|s_j^c)} \\ &\approx& \displaystyle\frac{p(s_i\bigl|s_j)}{p(s_i\bigl)} \\ \end{array}\]

For users who have consumed item \(j\), the item \(i\) with the highest score \(R_{j\rightarrow{}i}\) should be recommended to them.

For items with a low amount of user interaction, the \(R_{j\rightarrow{}i}\) score will be unstable. This could be alleviated by making the calculation Bayesian, and shrinking the metric using a chosen prior distribution.










Association Rules-Based Recommendation

[back to contents]

Association Rules are rules of the form

\[\overset{\text{antecedant set}}{\{...\}} \quad\overset{\text{implies}}{=>} \quad \overset{\text{consequent set}}{\{...\}}\]

Very efficient algorithms such as the a priori algorithm have been devised for searching data for rules of this form.

Association Rules are particularly effective in the case of unary ratings.

Here are some of the ways that association Rules can be used to generate item recommendations:

  1. Item-wise Recommendations:

    • \(\overset{\text{item set}}{\{...\}} \quad\overset{\text{implies}}{=>} \quad \overset{\text{item set}}{\{...\}}\)

    • Example: \(\{\text{bread, tomato}\}=>\{\text{cheese}\}\quad\) i.e. users who have bought bread and tomato should be recommended cheese.

  2. User-wise Recommendations

    • \(\overset{\text{user set}}{\{...\}} \quad\overset{\text{implies}}{=>} \quad \overset{\text{user set}}{\{...\}}\)

    • Example: \(\{\text{alice, bob}\}=>\{\text{john}\}\quad\) i.e. if users “Alice” and “Bob” have bought bought an item then “John” is also likely to like it

  3. Profile Assocation Rules:

    • \(\overset{\text{user attribute set}}{\{...\}} \quad\overset{\text{implies}}{=>} \quad \overset{\text{item set}}{\{...\}}\)

    • Example: \(\{\text{male, age30-39, 2_children}\}=>\{\text{home loan}\}\quad\) i.e. a large proportion of male users in their 30s with 2 children have consumed the item “home loan”, making it a promising recommendation for this user segment

Here is python code showing how global association rules can be used to generate recommendations










Sequential Pattern Mining

[back to contents]

TODO










Clustering-Based Recommendation

[back to contents]

TODO










Graph-Based Collaborative Filtering

[back to contents]

When using a neighbourhood-based collaborative filtering approach, sparsity of the ratings matrix can sometimes make it impossible to obtain a satisfactory set of similar users (items) for some of the users (items). This problem is elegantly solved by representing the relationships between users and/or items using a graph, since it allows one to measure the similarity between users (items) via intermediate users (items) e.g. users are considered more similar if they have a shorter path between them in the graph (even if they have no items in common). Relationships between users and items can be cleanly described (and modelled) using graphs, in which nodes represent entities (e.g. user or item) and edges represent relationships between them.

Graphs define a novel measure of distance (dissimilarity) between entities: the length of a path between them, travelling along edges (e.g. using the shortest path, or using a random walk).

image source: author

Here is python code implementing graph-based collaborative filtering using NetworkX and graph-walker

image source: author

See also: Graph Neural Networks










Matrix Factorization (Latent Factor Models)

[back to contents]

Matrix factorization refers to the process of learning a low-rank approximation of the user/item ratings matrix then using this low-rank representation to infer the missing (unobserved) ratings in the matrix.

image source: author

There are many different ways in which this matrix factorization can be performed, each of which has various different strengths and weaknesses. These variants are defined by the constraints which they impose on the latent factors. Imposing constraints on the latent factors will always decrease accuracy (increase error) on the observed (training) data, but these constraints can also improve model generalization (error on unobserved data) and increase model interpretability.

Here is a summary of factorization methods from (Aggarwal 2016) -

The Family of Matrix Factorization Methods
Method Constraints on factor matrices Advantages/disadvantages
Unconstrained none Highest quality solution
Good for most matrices
Regularisation prevents overfitting
Poor interpretability
Singular Value Decomposition (SVD) orthogonal basis Good visual interpretability
Out-of-sample recommendations
Good for dense matrices
Poor semantic interpretability
Suboptimal in sparse matrices
Maximum Margin none Highest quality solution
Resists overfitting
Similar to unconstrained
Poor interpretability
Good for discrete ratings
Non-Negative Matrix Factorization (NMF) non-negativity Good quality solution
High semantic interpretability
Loses interpretability with both like/dislike ratings
Less overfitting in some cases
Best for implicit feedback
Probabilistic Latent Semantic Analysis (PLSA) non-negativity Good quality solution
High semantic interpretability
Probabilistic interpretation
Loses interpretability with both like/dislike ratings
Less overfitting in some cases
Best for implicit feedback

Matrix factorization models can also be combined with other recommender models within a single model architecture. Refer to:

  1. Integrating Latent Factor Models with Arbitrary Models

  2. Hybrid Systems

Latent factor models can also explicitly model recommendation context (refer to Contextual Latent Factor Models)

As with all collaborative filtering, latent factor models suffer from the cold start problem (they struggle to generate recommendations for new users and new items). This problem is somewhat alleviated by the two tower model, which is a natural extension incorporating user and item features into the model.

Here are python code implementations of matrix factorization:

Performing matrix factorization in an automatic differentiation framework such as tensorflow or pytorch is not the fastest way to perform matrix factorization, but provides a general model framework which is easy to extend e.g. to incorporate additional data sources such as item attributes or recommendation context (e.g. see Two Tower Models and Integrating Latent Factor Models with Arbitrary Models).










Naïve Bayes Collaborative Filtering

[back to contents]

Another way to estimate unobserved entries in the user/item matrix is to model the rating of items as a generative probabilistic process (i.e. the probability of a particular user giving a particular rating to a particular item is governed by a probability distribution).

\[\begin{array}{lcl}p\Big(r_{uj}=v_s\Bigl|\text{observed ratings in } I_u\Big) &=& \displaystyle\frac{p\Big(\text{observed ratings in } I_u\Bigl|r_{uj}=v_s\Big) \times p\Big(r_{uj}=v_s\Big)}{ p\Big(\text{observed ratings in } I_u\Big)} \\ &\propto& p\Big(\text{observed ratings in } I_u\Bigl|r_{uj}=v_s\Big) \times p\Big(r_{uj}=v_s\Big) \\ r_{uj} &=& \text{rating of item } j \text{ by user } u \\ v_s &=& \text{item rating (one of } l \text{ discrete ratings } \{v_1,v_2,...,v_l\}\text{)}\\ I_u &=& \text{set of items already rated by user } u \\ \end{array}\]

Each of the quantities in this formula can be estimated directly from the data in the observed user/item ratings matrix.

Note that the denominator \(p\Big(\text{observed ratings in } I_u\Big)\) (if it is needed) need not be estimated directly - it can be calculated using the basic probability law \(\displaystyle\sum_{s=1}^l p\Big(r_{uj}=v_s\Bigl|\text{observed ratings in } I_u\Big) = 1\).

\(p\Big(r_{uj}=v_s\Big)\) (probability of user \(u\) giving rating \(v_s\) to item \(j\) prior to observing any data on user \(u\)) can be estimated using \(\hat{p}\Big(r_{uj}=v_s\Big)=\displaystyle\frac{\text{number of users who have given rating } v_s \text{ to item } j}{\text{number of users who have rated item } j }\).

\(p\Big(\text{observed ratings in } I_u\Bigl|r_{uj}=v_s\Big)\) (the relative likelihood of of the observed ratings for the items \(I_u\) given a specific rating \(v_s\) for item \(j\)) can be simply estimated (using a VERY strong independence assumption) using:

\[\begin{array}{lcl}\hat{p}\Big(\text{observed ratings in } I_u\Bigl|r_{uj}=v_s\Big) &=& \displaystyle\prod_{k\in I_u}p\Big(r_{uk}\Bigl|r_{uj}=v_s\Big) \\ \hat{p}\Big(r_{uk}\Bigl|r_{uj}=v_s\Big) &=& \displaystyle\frac{\text{number of users who have given rating } v_s \text{ to item } j \text{ and rating } r_{uk} \text{ to item } k }{\text{number of users who have given rating } v_s \text{ to item } j} \end{array}\]

This assumption of independence between the probabilities of user \(u\)’s ratings for the items in \(I_u\) (after observing their rating of item \(j\)) is extremely unrealistic, but can still result in a performant model in practice.










Knowledge-Based Recommendation

[back to contents]

Knowledge-based recommender systems generate recommendations by combining an explicit user query with domain knowledge/algorithmic intelligence (e.g. user attributes, item attributes, historic item consumption data, recommendation context, item availability etc.). In this way, they lie somewhere on the spectrum between a pure queryless recommender system and a search engine.

Knowledge-based recommendation is typically facilitated through an iterative/cyclic exchange between the user and an interactive user interface, for example:

\[\text{user query} >> \text{query result} >> \text{refined user query} >> \text{refined query result} >> \text{refined user query} >> \text{refined query result} >> ...\text{etc. (until user finds a satisfactory item)}\] Knowledge-based recommender systems are well-suited to recommendation contexts in which each item is unique, and not sold often (e.g. houses and cars).

Knowledge-Based Recommendation: Constraint-Based

User provides a set of constraints, and item recommendations are then provided from the subset of items matching the user constraints (the items within the subset can be further ranked by a relevance model). Example constraints for a house search:

  • “homes in East London”

  • “price < $100”

  • “bathrooms > 2”

After receiving the query result, the user can refine their constraint set (make the search more liberal or more conservative) and rerun the query.

Knowledge-Based Recommendation: Case-Based

User provides a case/target/anchor item, and the recommender system then finds other items similar to it, possibly along user-specified item dimensions. Example “please find me songs similar to “Come Together” by the Beatles.










Hybrid Systems

[back to contents]

A hybrid system is one that integrates multiple different models into a single combined architecture.

Here are some examples:










Graph Neural Networks (GNNs)

[back to contents]

Graph Neural Networks (GNNs) are a powerful tool for (elegantly) incorporating user/item interaction data (collaborative filtering), user attribute data (demographic filtering), and item attribute data (content-based filtering) within a single model.

They are particularly powerful for recommendation tasks since the relationships between users and items are often well-captured by a graph representation.

Examples of Different Graph Representations. Image source: author

GNNs can also model heterogeneous relationships (link types) between nodes (e.g. an edge between a user node and item node within the same graph could represent either a “view”, a “click” or a “purchase”).

GNNs learn a unique \(d^{(K)}\)-dimensional real-valued vector representation (embedding) of each node in a given graph. These node embeddings can then be used directly as model features (inputs) in downstream modelling tasks (e.g. edge/link prediction).

The node embeddings are learned using an encoder/decoder architecture:

Image Source: Hamilton (2022)

Here is a general description of a typical structure for the encoder model:

  1. Each node \(u\) is initialized with a real-valued vector representation \(\mathbf{x}_u=\mathbf{h}_u^{(0)}\) (containing the nodes attribute data)

  2. Each node (\(u\))’s vector representation \(\mathbf{h}_u^{(0)}\) is updated by combining it with an aggregation of the vector representations of it’s immediate neighbours (this is called message passing):

\[\begin{array}{lcl} \underset{\text{vector representation }\\\text{of node } u \text{ at iteration } \\k+1}{\underbrace{h_u^{(k+1)}}} &=& \underset{\text{some chosen}\\\text{function}\\\text{(e.g. neural net)}}{\underbrace{\text{UPDATE}}}\Big(\underset{\text{vector representation}\\\text{of node } u \text{ at}\\\text{iteration } k}{\underbrace{\mathbf{h}_u^{(k)}}}, \underset{\text{some aggregation of the vector}\\\\\text{representations (at iteration } k \text{)}\\\text{of the nodes linked to } u \\\text{ (i.e. combined information from}\\u\text{'s immediate 1-hop neighbours) }}{\underbrace{\mathbf{m}^{(k)}_{\mathcal{N}(u)}}}\Big) \\ \mathbf{m}^{(k)}_{\mathcal{N}(u)} &=& \underset{\text{some chosen}\\\text{function}\\\text{(e.g. sum)}}{\underbrace{\text{AGGREGATE}}}\bigg(\underset{\text{set of vector representations}\\\text{of all nodes neighbouring}\\\text{(directly linked to) node } u \\\text{(at iteration }k\text{)}}{\underbrace{\Big\{\mathbf{h}_v^{(k)},\forall v \in \mathcal{N}(u)\Big\}}}\bigg)\\ \end{array}\]

  1. Step (2) (previous update step) is (potentially) repeated multiple times. Since each update step passes information between immediate neighbours, multiple update steps result in nodes incorporating information from their more distant neighbours.

Two Layer Message-Passing Example. Image source: Hamilton (2022) (with minor modifications)

  1. After \(K\) update steps, the resulting vector representation \(\mathbf{z}_u=\mathbf{h}_u^{(K)}\) is a numeric embedding containing information both about the node \(u\) itself and about the local structure of the graph around the node \(u\).

The choices of the \(\text{UPDATE()}\) and \(\text{AGGREGATE()}\) functions define the encoder architecture. Very many different ones have been proposed (refer to e.g. Hamilton (2022)).

Here is a basic example (from Hamilton (2022)):

\[\begin{array}{lcl} \mathbf{h}_u^{(k+1)} &=& \overset{\text{UPDATE()}}{\overbrace{\sigma\Bigg(\mathbf{W}_{self}^{(k+1)}\mathbf{h}_u^{(k)}+\mathbf{W}_{neigh}^{k+1}\underset{\text{AGGREGATE()}}{\Big(\underbrace{\displaystyle\sum_{v\in \mathcal{N}(u)}\mathbf{h}_v^{(k)}}\Big)} + \mathbf{b}^{(k+1)}\Bigg)}} \\ \sigma() &=& \text{element-wise non-linear function ('activation function') such as } tanh, sigmoid, ReLU \text{ etc.} \\ \mathbf{W}_{self}^{(k+1)}, \mathbf{W}_{neigh}^{(k+1)} &\in& \mathbb{R}^{d^{(k+1)}\times d^{(k)}} \space \text{are matrices of trainable parameters (weights)}\\ \mathbf{b}^{(k+1)} &\in& \mathbb{R}^{d^{(k+1)}} \text{ is a vector of trainable parameters (weights)} \\ d^{(k+1)} &=& \text{dimension of vector representation (embedding) at iteration } k+1 \\ \end{array}\]

\[\overset{\mathbf{W}^{(k+1)}}{\begin{bmatrix}\cdot&\cdot&\cdot&\cdot\\\cdot&\cdot&\cdot&\cdot\end{bmatrix}} \overset{\mathbf{h}_*^{(k)}}{\begin{bmatrix}\cdot\\\cdot\\\cdot\\\cdot\end{bmatrix}} \quad=\quad \overset{\mathbf{h}_*^{(k+1)}}{\begin{bmatrix}\cdot\\\cdot\end{bmatrix}}\]

See also: Graphs can be used in order to directly model the similarity between users (or between items) for direct use in a collaborative filtering model - refer to Graph-Based Collaborative Filtering.










Tradeoffs Between Various Recommendation Algorithms

[back to contents]

algorithm can incorporate user/item interaction data can incorporate user attribute data can incorporate item attribute data can incorporate recommendation context data training time strengths weaknesses explainable recommendations
template ? ? ? ? ? ? ? ?
template ? ? ? ? ? ? ? ?
Association Rules ? ? ? ? ? ? ? ?
Content-Based Filtering ? ? ? ? ? ? ? ?
Factorization Machine yes yes yes yes fast ? ? ?
Feature-Weighted Linear Stacking ? ? ? ? ? ? ? ?
Graph Neural Network (GNN) yes yes yes no medium ? ? no
Matrix Factorization (Latent Factor Model) yes no no no fast ? ? no
Naive Bayes Collaborative Filtering ? ? ? ? ? ? ? ?
Neighbour-Based Collaborative Filtering:User-User and/or Item-Item ? ? ? ? ? ? ? ?
Neighbour-Based Collaborative Filtering:Graph-Based ? ? ? ? ? ? ? ?
Two Tower Model yes, if user ID embedding and item ID embedding are included in the model yes yes yes medium alleviates cold start problem ? no










Factorization Machines

[back to contents]

Factorization Machines (Rendle (2010)) are simply linear regression models with interactions, but in which the model coefficients on the interaction terms are modelled using a latent factor model. This factorization helps to avoid overfitting and improve generalization, especially when modelling sparse data. This is achieved by breaking the independence between the interaction coefficients (i.e. allowing/forcing information sharing between them).

image source: Rendle (2010)

The basic model is defined:

\[\begin{array}{lcl} \hat{y}_i(\overrightarrow{\mathbf{x}}_i) &:=& \underset{\text{standard linear regression}}{\underbrace{w_0 + \displaystyle{\sum_{i=1}^p}w_ix_i}} + \underset{\text{latent factor model of all 2-way interactions}}{\underbrace{\displaystyle{\sum_{i=1}^p\sum_{j=i+1}^p<\overrightarrow{\mathbf{v}}_i,\overrightarrow{\mathbf{v}}_j>}x_ix_j}} \\ \space &\space& w_0\in \mathbb{R}, \quad\overrightarrow{\mathbf{w}} \in \mathbb{R}^{p}, \quad \mathbf{V} \in \mathbb{R}^{p \times k} \\ \end{array}\]










Incorporating Context

[back to contents]

In certain domains, the context (e.g. time, season, user location, user device etc.) in which the recommendation is delivered can have a material effect on the relevance of the recommendation. Obvious examples are the season (e.g. winter) in which clothing items are being recommended, or the location of the user for a restaurant recommendation.

The Multidimensional Ratings Cube (source: Aggarwal (2016))

Aggarwal (2016) describes 3 broad approaches:

  1. Contextual Pre-Filtering: A separate recommendation model is built for each unique context (i.e. on a 2-Dimensional user/item “slice” of the ratings hypercube).

  2. Contextual Post-Filtering: A global model (which ignores all contextual information) is built. The predictions from this model are then adjusted using contextual information. A very simple example of this is to built a recommendation model on all items, but then only allow winter clothing to be recommended in winter.

  3. Contextual Modelling: Contextual information is explicitly used within the architecture of the recommendation model.

Incorporating Context: Contextual Pre-Filtering

In contextual pre-filtering, a separate standard recommendation model is built for each unique context (i.e. on a 2-Dimensional user/item “slice” of the ratings hypercube).

For example, if the contextual dimensions were

  1. user location

  2. time of day

..then an independent model would be built for every unique user location/time of day combination, with each individual model train on only the observed ratings within that user location/time of day slice of the data.

This technique increases data sparsity, which achieves relevance at the cost of increased variance/risk of overfitting. The granularity of the context segment (slice of the data) can be adjusted in order to control the balance between relevance and sparsity. For example, one could model each of the time contexts as an hourly slice:

\[\{\text{8pm, 9pm, 10pm, ...}\}\]

..or rather reduce the granularity of this context to 3-hourly slices:

\[\{\text{"1pm-3pm", "4pm-6pm", "7pm-9pm", ...}\}\]

This granularity decision can be:

  1. Heuristic: e.g. using the finest granularity containing at least \(n\) observed ratings

  2. Ensemble-Based: model multiple different granularities using separate models and then combine their predictions into a single prediction.

  3. Decided using Cross-Validation: find the optimal granularity on a chosen metric using holdout data

Incorporating Context: Contextual Post-Filtering

In contextual post-filtering, a global recommendation model (which ignores all contextual information) is built. The predictions from this global model are then adjusted using contextual information.

Aggarwal (2016) describes 2 general approaches:

  1. Heuristic: Generate predictions for all user/item pairs using the global model, and then to simply screen out items which are irrelevant to the recommendation context at prediction time (e.g. show user the highest ranked winter items during winter).

  2. Model-based: \[\begin{array}{lcl} \hat{r}_{ijc} &=& \text{predicted rating of item } j \text{ by user } i \text{ in context } c \\ &=& \overset{\text{local model}}{\overbrace{p(i,j,c)}} \quad \times \overset{\text{global model}}{\overbrace{\quad\hat{r}_{ij}\quad}} \\ p(i,j,c) &=& \text{predicted relevance of item } i \text{ to user } j \text{ in context } c \\ \hat{r}_{ij} &=& \text{predicted rating of item } i \text{ by user } j \text{ (using global model trained without context data)} \\ \end{array}\] note that \(p(i,j,c)\) could alternatively be replaced by a model \(p(j,c)\) which does not consider the user.

Incorporating Context: Contextual Modelling

A contextual model explicitly incorporates the recommendation context data into the architecture of the model itself.

Some examples are:

Incorporating Context: Contextual Modelling: Contextual Latent Factor Models

Explanation here TODO!

One example is Pairwise Interaction Tensor Factorization (TODO: ref), which is defined:

(source: Aggarwal (2016))

\[\begin{array}{lcl} \hat{r}_{ijc} &=& \text{predicted rating of item } j \text{ by user } i \text{ in context } c \\ &=& (\mathbf{U}\mathbf{V}^T)_{ij} + (\mathbf{V}\mathbf{W}^T)_{jc} + (\mathbf{U}\mathbf{W}^T)_{ic} \\ &=& \displaystyle\sum_{s=1}^k \Big(u_{is}v_{js} + v_{js}w_{cs} + u_{is}w_{cs}\Big) \\ m &=& \text{number of unique users} \\ n &=& \text{number of unique items} \\ d &=& \text{number of unique contexts} \\ k &=& \text{dimension of latent space} \\ \mathbf{U} &\in& \mathbb{R}^{m\times k} \quad \text{(matrix of user factors)} \\ \mathbf{W} &\in& \mathbb{R}^{n\times k} \quad \text{(matrix of item factors)} \\ \mathbf{V} &\in& \mathbb{R}^{d\times k} \quad \text{(matrix of context factors)} \\ \end{array}\]

Incorporating Context: Contextual Modelling: Contextual Neighbourhood-Based Models

TODO










Session-Based Recommendation

[back to contents]

S. Wang et al. (2022)










Wide & Deep Model

[back to contents]

Inspired by the observation that shallow models (like linear models) tend to be better at memorization (finding specific reoccurring patterns in data) while deep models tend to be better at generalization (approximating the structure of the underlying data-generating system), the Wide and Deep (Cheng et al. 2016) model is an architecture containing both a wide component and a deep component - an attempt to combine the specific strengths of these 2 different models in a unified framework.

Google evaluated this model for app recommendation on their app store Google Play.

image source: author

More specifically:










Deep & Cross Model

[back to contents]

The Deep & Cross (R. Wang et al. 2021) model is a modification of the Wide & Deep model architecture, but designed to learn explicit low-level feature interaction terms in an automated way.

It accomplishes this using cross layers (see illustration below). Stacking multiple cross layers in parallel results in increasing degrees of feature interaction (1 cross layer gives first order interactions ab, 2 cross layers gives 2nd order interactions abc etc.). Polynomial terms are also created (\(a^3\), \(a^2b\) etc.).

The Structure of a Cross Layer. Image Source: R. Wang et al. (2021)

For a deeper understanding of the behaviour of this cross layer, refer to this resource.

Architectures explanation:

Image source: R. Wang et al. (2021) STILL IN PROGRESS: Here is a PyTorch (python) implementation of the Deep & Cross model










Two Tower Model

[back to contents]

The Two Tower model is a natural extension of the Matrix Factorization (Latent Factor) model, additionally incorporating user features and item features into the architecture (which helps to alleviate the cold start problem inherent in collaborative filtering models in general).

The architecture consists of 2 distinct and parallel models, the parameters of which are optimized simultaneously during model training:

  1. a user-encoder (which outputs a user embedding)

  2. an item-encoder (which outputs an item embedding)

The outputs of the 2 encoders are then combined to produce a single model prediction.

image source: author

Note that in order to be an extension of the matrix factorization (latent factor) model, the two tower model must learn a latent representation (embedding) of the user (concatenated with the input user features) and a latent representation (embedding) of the item (concatenated with the input item features). The model will still work without these latent factors, but is no longer a straightforward extension of the matrix factorization (latent factor) model (it no longer explicitly contains the user/item interaction information).

Although I couldn’t find any mention of it in the literature, incorporating context data into the two tower model by including a third tower (a context encoder) seems like a logical extension of the architecture. In this case, the dot product operation (which combines the outputs of the towers into a single prediction) would need to be replaced by another operation (such as a feedforward neural network).

Here is a python implementation of the 2-Tower model using TensorFlow










Integrating Latent Factor Models with Arbitrary Models

[back to contents]

Latent factor models can be integrated with another recommendation architecture (i.e. be included as a module within the overall architecture, where the latent factors are learned simultaneously with the other model parameters) using a simple linear combination:

\[\begin{array}{lcl} \hat{r}_{ijl} &=& \underset{\text{main effects:}\\\text{user generosity &}\\\text{item popularity}}{\underbrace{\quad o_i + p_j \quad}} + \underset{\text{matrix factorization/}\\\text{latent factor model}}{\underbrace{\quad \overrightarrow{u}_i \cdot \overrightarrow{v}_j \quad}} + \beta \space f\bigg(\underset{\text{content model}}{\underbrace{\overrightarrow{c}_i^{(user)}, \overrightarrow{c}_j^{(item)}}}, \underset{\text{collaborative model}}{\underbrace{\overrightarrow{r}_i^{(user)}, \overrightarrow{r}_j^{(item)}}}, \underset{\text{context}\\\text{info.}}{\underbrace{\overset{\rightarrow{}}{\hspace{5mm}z_l\hspace{5mm}}}} \bigg)\\ \space &\space& \space \\ \hat{r}_{ijl} &=& \text{predicted rating of item } j \text{ by user } i \text{ in context } l\\ o_i &=& \text{user } i \text{ bias (e.g. user who rates all items highly)} \\ p_j &=& \text{item } j \text{ bias (e.g. item that all users rate highly)} \\ \overrightarrow{u}_i &=& \text{latent factor representation of user } i\\ \overrightarrow{v}_j &=& \text{latent factor representation of item } j \\ \overset{\rightarrow{}}{z} &=& \text{recommendation context information (vector)}\\ \beta &=& \text{adjustment/weight of non-latent portion of model} \\ f() &=& \text{any chosen function (e.g. linear regression)} \\ \overrightarrow{c}_i^{(user)} &=& \text{user } i \text{ attribute/content vector} \\ \overrightarrow{c}_j^{(item)} &=& \text{item } j \text{ attribute/content vector} \\ \overrightarrow{r}_i &=& \text{user } i \text{ observed ratings vector (row of user/item ratings matrix)} \\ \overrightarrow{r}_j &=& \text{item } j \text{ observed ratings vector (column of user/item ratings matrix)} \\ \end{array}\]

Some possible choices of function \(f()\) are:










References

[back to contents]

Aggarwal, Charu C. 2016. Recommender Systems - the Textbook. Springer.
Armstrong, Robert. 2008. The Long Tail: Why the Future of Business Is Selling Less of More. Canadian Journal of Communication. Vol. 33. https://doi.org/10.22230/cjc.2008v33n1a1946.
Bastani, Hamsa, Mohsen Bayati, and Khashayar Khosravi. 2017. Mostly Exploration-Free Algorithms for Contextual Bandits. arXiv. https://doi.org/10.48550/ARXIV.1704.09011.
Cheng, Heng-Tze, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, et al. 2016. Wide & Deep Learning for Recommender Systems. arXiv. https://doi.org/10.48550/ARXIV.1606.07792.
Divina, Federico, Aude Gilson, Francisco Gómez-Vela, Miguel Garcia Torres, and José Torres. 2018. Stacking Ensemble Learning for Short-Term Electricity Consumption Forecasting. Energies. Vol. 11. https://doi.org/10.3390/en11040949.
Foster, Dylan J., Alekh Agarwal, Miroslav Dudík, Haipeng Luo, and Robert E. Schapire. 2018. Practical Contextual Bandits with Regression Oracles. arXiv. https://doi.org/10.48550/ARXIV.1803.01088.
Guo, Huifeng, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. 2017. DeepFM: A Factorization-Machine Based Neural Network for CTR Prediction. arXiv. https://doi.org/10.48550/ARXIV.1703.04247.
Hamilton, William L. 2022. Graph Representation Learning. Synthesis Lectures on Artificial Intelligence and Machine Learning. Vol. 14. Morgan; Claypool.
Jannach, Dietmar. 2022. Multi-Objective Recommender Systems: Survey and Challenges. arXiv. https://doi.org/10.48550/ARXIV.2210.10309.
Kaminskas, Marius, and Derek Bridge. 2016. Diversity, Serendipity, Novelty, and Coverage: A Survey and Empirical Analysis of Beyond-Accuracy Objectives in Recommender Systems. ACM Trans. Interact. Intell. Syst. Vol. 7. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/2926720.
Li, Lihong, Wei Chu, John Langford, and Robert E. Schapire. 2010. A Contextual-Bandit Approach to Personalized News Article Recommendation. ACM Press. https://doi.org/10.1145/1772690.1772758.
Li, Xiangyang, Bo Chen, Huifeng Guo, Jingjie Li, Chenxu Zhu, Xiang Long, Sujian Li, et al. 2022. IntTower: The Next Generation of Two-Tower Model for Pre-Ranking System. CIKM ’22. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/3511808.3557072.
Rendle, Steffen. 2010. Factorization Machines. 2010 IEEE International Conference on Data Mining.
Sill, Joseph, Gábor Takács, Lester W. Mackey, and David Lin. 2009. Feature-Weighted Linear Stacking. CoRR. Vol. abs/0911.0460. http://arxiv.org/abs/0911.0460.
Wang, Ruoxi, Rakesh Shivanna, Derek Cheng, Sagar Jain, Dong Lin, Lichan Hong, and Ed Chi. 2021. DCN V2: Improved Deep & Cross Network and Practical Lessons for Web-Scale Learning to Rank Systems. ACM. https://doi.org/10.1145/3442381.3450078.
Wang, Shoujin, Qi Zhang, Liang Hu, Xiuzhen Zhang, Yan Wang, and Charu Aggarwal. 2022. Sequential/Session-Based Recommendations. ACM. https://doi.org/10.1145/3477495.3532685.